feat(bootstrap,cli): switch GPU injection to CDI where supported by elezar · Pull Request #495 · NVIDIA/OpenShell

elezar · 2026-03-20T07:41:28Z

Summary

Switch GPU device injection in cluster bootstrap to use CDI (Container Device Interface) when enabled in Docker (the docker info endpoint returns a non-empty list of CDI spec directories). When this is not the case existing --gpus all NVIDIA DeviceRequest path is used as a fallback. The --gpu flag on gateway start is extended to let users force the legacy injection mode.

Related Issue

Part of #398

Changes

feat(bootstrap): Auto-select CDI (driver="cdi", device_ids=["nvidia.com/gpu=all"]) if CDI is enabled on the daemon; fall back to legacy driver="nvidia" on older daemons or when CDI spec dirs are absent
feat(cli): --gpu now accepts an optional value: omit for auto-select, --gpu=legacy to force the legacy --gpus all path
test(e2e): Extend gateway start help smoke test to cover --gpu and --recreate flags

Testing

mise run pre-commit passes
Unit tests added/updated (resolve_gpu_device_ids coverage)
E2E tests added/updated

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

crates/openshell-bootstrap/src/docker.rs

github-actions · 2026-03-20T08:07:56Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/OpenShell/pr-preview/pr-495/
Built to branch `gh-pages` at 2026-03-27 11:20 UTC. Preview will be ready when the GitHub Pages deployment is complete.

architecture/gateway-single-node.md

Use an explicit CDI device request (driver="cdi", device_ids=["nvidia.com/gpu=all"]) when the Docker daemon reports CDI spec directories via GET /info (SystemInfo.CDISpecDirs). This makes device injection declarative and decouples spec generation from consumption. When the daemon reports no CDI spec directories, fall back to the legacy NVIDIA device request (driver="nvidia", count=-1) which relies on the NVIDIA Container Runtime hook. Failure modes for both paths are equivalent: a missing or stale NVIDIA Container Toolkit installation will cause container start to fail. CDI spec generation is out of scope for this change; specs are expected to be pre-generated out-of-band, for example by the NVIDIA Container Toolkit. Signed-off-by: Evan Lezar <elezar@nvidia.com>

The --gpu flag on `gateway start` now accepts an optional value: --gpu Auto-select: CDI on Docker >= 28.2.0, legacy otherwise --gpu=legacy Force the legacy nvidia DeviceRequest (driver="nvidia") Internally, the gpu bool parameter to ensure_container is replaced with a device_ids slice. resolve_gpu_device_ids resolves the "auto" sentinel to a concrete device ID list based on the Docker daemon version, keeping the resolution logic in one place at deploy time. Signed-off-by: Evan Lezar <elezar@nvidia.com>

Signed-off-by: Evan Lezar <elezar@nvidia.com>

pimlock · 2026-03-31T00:30:02Z

architecture/gateway-single-node.md

+| `--gpu` | Auto-select: CDI when enabled on the daemon, `--gpus all` otherwise |
+| `--gpu=legacy` | Force `--gpus all` |


This simplification looks great, I was planning to comment on this and iterate on this and noticed it got updated.

For some extra context: one thing to keep in mind was that we won't be dependent on Docker in the long term (Drew is working on having VM-based deployment mode), so it's good to keep options small, so we don't have to deprecate/remove them in the VM.

On that note, we are thinking that removing the legacy option makes sense as well. The CDI has been on by default since Docker 28.2 (released in May 2025). If we get reports that it's needed for some reason, it's easy to add it then, and if it's not needed, we won't need to deprecate it with the switch to the VM.

pimlock · 2026-03-31T00:33:41Z

There is one more spot that needs to be updated with this change: https://github.com/NVIDIA/nv-agent-env/blob/1f2a85e873a77ebb38fb492062f9fc936617f08a/crates/openshell-cli/src/main.rs#L1104-L1110

When the sandbox create command is run, but there is no gateway yet, we create the gateway as part of that command and the sandbox's GPU option needs to updated accordingly.

elezar requested a review from a team as a code owner March 20, 2026 07:41

elezar self-assigned this Mar 20, 2026

elezar marked this pull request as draft March 20, 2026 07:44

benhadad mentioned this pull request Mar 20, 2026

OpenShell gateway on WSL2/Docker Desktop installs but openshell-0 stays in ContainerCreating because TLS secrets are never created NVIDIA/NemoClaw#333

Open

klueska reviewed Mar 20, 2026

View reviewed changes

crates/openshell-bootstrap/src/docker.rs Outdated Show resolved Hide resolved

elezar force-pushed the feat/cdi-in-cluster branch from b1e6015 to dd2682c Compare March 20, 2026 08:06

elezar force-pushed the feat/cdi-in-cluster branch 2 times, most recently from 808270c to f304997 Compare March 20, 2026 09:56

elezar marked this pull request as ready for review March 20, 2026 10:04

elezar force-pushed the feat/cdi-in-cluster branch from f304997 to 501b0c1 Compare March 20, 2026 21:29

elezar requested a review from klueska March 23, 2026 12:15

elezar force-pushed the feat/cdi-in-cluster branch 2 times, most recently from abee8e5 to 7389f4b Compare March 25, 2026 13:42

This was referenced Mar 25, 2026

feat(gpu): add WSL2 CDI spec watcher for GPU passthrough #608

Open

fix(gpu): add Tegra/Jetson GPU support #625

Open

klueska reviewed Mar 26, 2026

View reviewed changes

architecture/gateway-single-node.md Outdated Show resolved Hide resolved

cheese-head mentioned this pull request Mar 26, 2026

feat(cli): add generic sandbox device request flags #628

Open

elezar added 3 commits March 27, 2026 12:10

test(e2e): extend gateway start help smoke test to cover key flags

c07c0f8

Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar force-pushed the feat/cdi-in-cluster branch from 7389f4b to c07c0f8 Compare March 27, 2026 11:19

elezar requested a review from klueska March 27, 2026 11:25

johntmyers approved these changes Mar 30, 2026

View reviewed changes

pimlock reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bootstrap,cli): switch GPU injection to CDI where supported#495

feat(bootstrap,cli): switch GPU injection to CDI where supported#495
elezar wants to merge 3 commits intomainfrom
feat/cdi-in-cluster

elezar commented Mar 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions bot commented Mar 20, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-27 11:20 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Uh oh!

pimlock Mar 31, 2026

Uh oh!

pimlock commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		\| `--gpu` \| Auto-select: CDI when enabled on the daemon, `--gpus all` otherwise \|
		\| `--gpu=legacy` \| Force `--gpus all` \|

Conversation

elezar commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

Uh oh!

github-actions bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-03-27 11:20 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Uh oh!

pimlock Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

pimlock commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

elezar commented Mar 20, 2026 •

edited

Loading

github-actions bot commented Mar 20, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-27 11:20 UTC.
Preview will be ready when the GitHub Pages deployment is complete.